正文

50 - What is k-means clustering and how to code it in Python

import pandas as pd

df = pd.read_excel('data/K_Means.xlsx')
df.head()

	X	Y
0	1	42
1	2	46
2	3	51
3	4	20
4	5	30

1 2	`import seaborn as sns sns.regplot(x=df['X'], y=df['Y'], fit_reg=False)`

<AxesSubplot:xlabel='X', ylabel='Y'>

png 我们知道初始值的选取对结果的影响很大，对初始值选择的改进是很重要的一部分。在所有的改进算法中，K-means++ 最有名。

K-means++ 算法步骤如下所示：

随机选取一个中心点 $a_1$ ；
计算数据到之前 n 个聚类中心最远的距离 $D(x)$ ，并以一定概率 $\frac{D(x)^2}{\Sigma D(x)^2}$ 选择新中心点 $a_i$ ；
重复第二步。

简单的来说，就是 K-means++ 就是选择离已选中心点最远的点。这也比较符合常理，聚类中心当然是互相离得越远越好。

但是这个算法的缺点在于，难以并行化。所以 k-means II 改变取样策略，并非按照 k-means++ 那样每次遍历只取样一个样本，而是每次遍历取样 $k$ 个，重复该取样过程 $\log(n)$ 次，则得到 $k\log(n)$ 个样本点组成的集合，然后从这些点中选取 $k$ 个。当然一般也不需要 $\log(n)$ 次取样，5 次即可。

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, init='k-means++', max_iter=300, n_init=10, random_state=0)
model = kmeans.fit(df)
predicted_values = kmeans.predict(df)

from matplotlib import pyplot as plt

plt.scatter(df['X'], df['Y'], c=predicted_values, s=50, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=200, c='black', alpha=0.5)

<matplotlib.collections.PathCollection at 0x2e8afc39d00>

png

51 - Image Segmentation using K-means

K-Means Clustering in OpenCV

import numpy as np
import matplotlib.pyplot as plt
import cv2

img = cv2.imread('images/BSE_Image.jpg')
plt.imshow(img)

<matplotlib.image.AxesImage at 0x202a78459d0>

png

将图像变成一维的

1 2	`img2 = img.reshape((-1, 3)) img2 = np.float32(img2)`

Now we apply the KMeans function. Before that we need to specify the criteria. My criteria is such that, whenever 10 iterations of algorithm is ran, or an accuracy of epsilon = 1.0 is reached, stop the algorithm and return the answer.
- 现在我们应用 KMeans 函数。在此之前，我们需要指定标准。我的标准是这样的，每当算法运行 10 次迭代，或达到 epsilon = 1.0 的精度时，停止算法并返回答案。
criteria : It is the iteration termination criteria. When this criteria is satisfied, algorithm iteration stops. Actually, it should be a tuple of 3 parameters. They are ( type, max_iter, epsilon ):
- 它是迭代终止条件。当满足该条件时，算法迭代停止。实际上，它应该是一个包含 3 个参数的元组。
- type of termination criteria. It has 3 flags as below:
  - cv.TERM_CRITERIA_EPS - stop the algorithm iteration if specified accuracy, epsilon, is reached.
    - 如果达到指定的精度 epsilon，则停止算法迭代。
  - cv.TERM_CRITERIA_MAX_ITER - stop the algorithm after the specified number of iterations, max_iter.
    - 在指定的迭代次数 max_iter 后停止算法。
  - cv.TERM_CRITERIA_EPS + cv.TERM_CRITERIA_MAX_ITER - stop the iteration when any of the above condition is met.
    - 当满足上述任何一个条件时，对迭代进行顶迭代。
- max_iter - An integer specifying maximum number of iterations.
  - 指定最大迭代次数的整数。
- epsilon - Required accuracy
  - 所需的精度

1	`criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)`

Clusters
attempts : Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness. This compactness is returned as output.
- 标志指定使用不同初始标记执行算法的次数。该算法返回产生最佳紧凑性的标签。这种紧凑度作为输出返回。
flags : This flag is used to specify how initial centers are taken. Normally two flags are used for this : cv.KMEANS_PP_CENTERS and cv.KMEANS_RANDOM_CENTERS.
- 此标志用于指定如何获取初始中心。通常使用两个标志:cv.KMEANS_PP_CENTERS 和 cv.KMEANS_RANDOM_CENTERS。
compactness : It is the sum of squared distance from each point to their corresponding centers.
- 紧凑度:它是每个点到它们相应中心的距离平方和。
labels : This is the label array (same as ‘code’ in previous article) where each element marked ‘0’, ‘1’…
- 这是标签数组(与上一篇文章中的’code’相同)，其中每个元素标记为’0’，‘1’…
centers : This is array of centers of clusters.
- 这是集群中心的数组。

k = 3
attempts = 10
ret, label, center = cv2.kmeans(img2, k, None, criteria, attempts, cv2.KMEANS_PP_CENTERS)
center = np.uint8(center)
res = center[label.flatten()]

将数组重新转成原来的形状

1	`res2 = res.reshape((img.shape))`

最终将图像缩减成了只有 3 种颜色
Color Quantization is the process of reducing number of colors in an image. One reason to do so is to reduce the memory. Sometimes, some devices may have limitation such that it can produce only limited number of colors. In those cases also, color quantization is performed. Here we use k-means clustering for color quantization.
- 颜色量化是减少图像中颜色数量的过程。这样做的一个原因是减少内存。有时，一些设备可能有限制，例如它只能产生有限数量的颜色。在这些情况下，还执行颜色量化。这里我们使用 k-means 聚类进行颜色量化。
There is nothing new to be explained here. There are 3 features, say, R,G,B. So we need to reshape the image to an array of Mx3size (M is number of pixels in image). And after the clustering, we apply centroid values (it is also R,G,B) to all pixels, such that resulting image will have specified number of colors. And again we need to reshape it back to the shape of original image.
- 这里没有什么新东西要解释的。有三个特征，R,G,B。因此我们需要将图像重塑为 Mx3 大小的数组(M 为图像中的像素数)。在聚类之后，我们对所有像素应用质心值(它也是 R,G,B)，这样得到的图像将具有指定数量的颜色。我们需要重新塑造它回到原始图像的形状。

center

array([[251, 251, 251],
       [151, 151, 151],
       [ 47,  47,  47]], dtype=uint8)

1	`plt.imshow(res2)`

<matplotlib.image.AxesImage at 0x202a83ac2b0>

png

52 - What is GMM and how to use it for Image segmentation

如何通俗的理解高斯混合模型（Gaussian Mixture Models）- 知乎 (zhihu.com)

import numpy as np
import cv2

img = cv2.imread('images/plant_cells.jpg')
img2 = img.reshape((-1, 3))

from sklearn.mixture import GaussianMixture as GMM

gmm_model = GMM(n_components=2, covariance_type='tied').fit(img2)
gmm_labels = gmm_model.predict(img2)
original_shape = img.shape
segmented = gmm_labels.reshape(original_shape[0], original_shape[1])

图像被分成 0 和 1 两类

1
2
3

import matplotlib.pyplot as plt

plt.imshow(segmented, cmap='gray')

<matplotlib.image.AxesImage at 0x1ee043bb4c0>

png

import numpy as np
import cv2

img = cv2.imread('images/BSE_Image.jpg')
img2 = img.reshape((-1, 3))

from sklearn.mixture import GaussianMixture as GMM

# 分成 4 类
gmm_model = GMM(n_components=4, covariance_type='tied').fit(img2)
gmm_labels = gmm_model.predict(img2)
original_shape = img.shape
segmented = gmm_labels.reshape(original_shape[0], original_shape[1])

import matplotlib.pyplot as plt

plt.imshow(segmented, cmap='gray')

<matplotlib.image.AxesImage at 0x1ee04598c10>

png

52b - Understanding Gaussian Mixture Model -GMM- using 1D- 2D- and 3D examples

Demonstration of GMM on 1D, 2D, and 3D data.

For 1D

First we generate data by sampling random data from two normal distributions
- 首先，我们通过从两个正态分布中抽样随机数据来生成数据
Then, we decmpose it into 3 (or different number) gaussians.
- 然后，我们将其分解为3个(或不同的数字)高斯函数。
Finally, we plot the original data and the decomposed Gaussians.
- 最后，绘制了原始数据和分解后的高斯数据。

Do something similar for 2D and 3D cases…
- 对 2D 和 3D 情况做类似的事情……
Generate data, perform GMM and plot individual components.
- 生成数据，执行 GMM 并绘制单个组件。

from sklearn import mixture
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as stats

Create some data
Draw samples from different normal distributions so we get data that is good to demonstrate GMM. Different mean and Std. dev.
- 从不同的正态分布中抽取样本，这样我们得到的数据可以很好地演示 GMM。不同的平均值和标准。
Concatenate to create a single data set

# 均值为 5 方差为 5 的 1000 个数据 和 均值为 10 方差为 2 的 1000 个数据
# 合并起来就是 2000 个数据
x = np.concatenate((np.random.normal(5, 5, 1000), np.random.normal(10, 2, 1000)))
plt.plot(x)

[<matplotlib.lines.Line2D at 0x1fc9d0cd7c0>]

png

1	`plt.hist(x, bins=100)`

png

1	`f = x.reshape(-1, 1)`

We created data from two normal distributions but for the fun of it let us decompose our data into 3 Gaussians. n_components=3
- 我们从两个正态分布创建了数据，但为了有趣，我们将数据分解为 3 个高斯分布。

g = mixture.GaussianMixture(n_components=3,covariance_type='full')
g.fit(f)
weights = g.weights_  # 权重
means = g.means_  # 均值
covars = g.covariances_  # 协方差

x_axis = x
x_axis.sort()

plt.hist(f, bins=100, histtype='bar', density=True, ec='red', alpha=0.5)
plt.plot(x_axis,weights[0]*stats.norm.pdf(x_axis,means[0],np.sqrt(covars[0])).ravel(), c='red')
plt.plot(x_axis,weights[1]*stats.norm.pdf(x_axis,means[1],np.sqrt(covars[1])).ravel(), c='green')
plt.plot(x_axis,weights[2]*stats.norm.pdf(x_axis,means[2],np.sqrt(covars[2])).ravel(), c='blue')

plt.grid()
plt.show()

png

2D example

1
2
3

from sklearn.datasets import make_blobs
import numpy as np
from matplotlib import pyplot as plt

Generate some data

X, y_true = make_blobs(n_samples=400, centers=4,
                       cluster_std=0.60, random_state=0)
X = X[:, ::-1]  # flip axes for better plotting 翻转坐标轴可以更好地绘图

rng = np.random.RandomState(13)
X_stretched = np.dot(X, rng.randn(2, 2))
plt.scatter(X_stretched[:, 0], X_stretched[:, 1], s=7, cmap='viridis')

<matplotlib.collections.PathCollection at 0x1fc9eac9310>

png

1
2
3

from sklearn.mixture import GaussianMixture as GMM

gmm = GMM(n_components=4, covariance_type='full', random_state=42)

from matplotlib.patches import Ellipse


def draw_ellipse(position, covariance, ax=None, **kwargs):
    """
    Draw an ellipse with a given position and covariance
    画一个具有给定位置和协方差的椭圆
    """
    ax = ax or plt.gca()
    
    # Convert covariance to principal axes 将协方差转换为主轴
    if covariance.shape == (2, 2):
        U, s, Vt = np.linalg.svd(covariance)
        angle = np.degrees(np.arctan2(U[1, 0], U[0, 0]))
        width, height = 2 * np.sqrt(s)
    else:
        angle = 0
        width, height = 2 * np.sqrt(covariance)
    
    # Draw the Ellipse
    for nsig in range(1, 4):
        ax.add_patch(Ellipse(position, nsig * width, nsig * height, 
                             angle, **kwargs))
        
        
def plot_gmm(gmm, X, label=True, ax=None):
    ax = ax or plt.gca()
    labels = gmm.fit(X).predict(X)
    if label:
        ax.scatter(X[:, 0], X[:, 1], c=labels, s=7, cmap='viridis', zorder=2)
    else:
        ax.scatter(X[:, 0], X[:, 1], s=7, zorder=2)
    ax.axis('equal')
    
    w_factor = 0.2 / gmm.weights_.max()
    for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
        draw_ellipse(pos, covar, alpha=w * w_factor)

1	`plot_gmm(gmm, X_stretched)`

png

1
2
3

import numpy as np
from sklearn.mixture import GaussianMixture
from matplotlib import pyplot as plt

Generate 3D data with 4 clusters set gaussian ceters and covariances in 3D
- 生成具有 4 个聚类的三维数据，集合高斯中心和三维协方差

means = np.array([[0.5, 0.0, 0.0],
                      [0.0, 0.0, 0.0],
                      [-0.5, -0.5, -0.5],
                      [-0.8, 0.3, 0.4]])

covs = np.array([np.diag([0.01, 0.01, 0.03]),
                     np.diag([0.08, 0.01, 0.01]),
                     np.diag([0.01, 0.05, 0.01]),
                     np.diag([0.03, 0.07, 0.01])])

n_gaussians = means.shape[0]  # Number of clusters 集群数量

N = 1000  # Number of points to be generated for each cluster. 为每个集群生成的点数。
points = []
for i in range(len(means)):
    x = np.random.multivariate_normal(means[i], covs[i], N )
    points.append(x)
points = np.concatenate(points)

Plot

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(points[:,0], points[:,1], points[:,2], s=1, alpha=1)
ax.view_init(35.246, 45)
plt.show()

png

Fit the gaussian model

1 2	`gmm = GaussianMixture(n_components=n_gaussians, covariance_type='diag') gmm.fit(points)`

Functions to visualize data 可视化数据的函数

import matplotlib.cm as cmx


def plot_sphere(w=0, c=[0,0,0], r=[1, 1, 1], subdev=10, ax=None, sigma_multiplier=3):
    '''
        plot a sphere surface
        Input: 
            c: 3 elements list, sphere center
            r: 3 element list, sphere original scale in each axis ( allowing to draw elipsoids)
            subdiv: scalar, number of subdivisions (subdivision^2 points sampled on the surface)
            ax: optional pyplot axis object to plot the sphere in.
            sigma_multiplier: sphere additional scale (choosing an std value when plotting gaussians)
        Output:
            ax: pyplot axis object
    '''

    if ax is None:
        fig = plt.figure()
        ax = fig.add_subplot(111, projection='3d')
    pi = np.pi
    cos = np.cos
    sin = np.sin
    phi, theta = np.mgrid[0.0:pi:complex(0,subdev), 0.0:2.0 * pi:complex(0,subdev)]
    x = sigma_multiplier*r[0] * sin(phi) * cos(theta) + c[0]
    y = sigma_multiplier*r[1] * sin(phi) * sin(theta) + c[1]
    z = sigma_multiplier*r[2] * cos(phi) + c[2]
    cmap = cmx.ScalarMappable()
    cmap.set_cmap('jet')
    c = cmap.to_rgba(w)

    ax.plot_surface(x, y, z, color=c, alpha=0.2, linewidth=1)

    return ax


def visualize_3d_gmm(points, w, mu, stdev):
    '''
    plots points and their corresponding gmm model in 3D
    Input: 
        points: N X 3, sampled points
        w: n_gaussians, gmm weights
        mu: 3 X n_gaussians, gmm means
        stdev: 3 X n_gaussians, gmm standard deviation (assuming diagonal covariance matrix)
    Output:
        None
    '''

    n_gaussians = mu.shape[1]
    N = int(np.round(points.shape[0] / n_gaussians))
    # Visualize data
    fig = plt.figure(figsize=(8, 8))
    axes = fig.add_subplot(111, projection='3d')
    axes.set_xlim([-1, 1])
    axes.set_ylim([-1, 1])
    axes.set_zlim([-1, 1])
    plt.set_cmap('Set1')
    colors = cmx.Set1(np.linspace(0, 1, n_gaussians))
    for i in range(n_gaussians):
        idx = range(i * N, (i + 1) * N)
        axes.scatter(points[idx, 0], points[idx, 1], points[idx, 2], alpha=0.3, c=colors[i])
        plot_sphere(w=w[i], c=mu[:, i], r=stdev[:, i], ax=axes)

    plt.title('3D GMM')
    axes.set_xlabel('X')
    axes.set_ylabel('Y')
    axes.set_zlabel('Z')
    axes.view_init(35.246, 45)
    plt.show()

1	`visualize_3d_gmm(points, gmm.weights_, gmm.means_.T, np.sqrt(gmm.covariances_).T)`

png

53 - How to pick optimal number of parameters for your unsupervised machine learning model

模型选择方法：AIC 和 BIC - 知乎 (zhihu.com)

import numpy as np
import cv2
from matplotlib import pyplot as plt

img = cv2.imread('images/alloy.jpg')
img2 = img.reshape((-1, 3))
plt.imshow(img)

<matplotlib.image.AxesImage at 0x2650b465fd0>

png

from sklearn.mixture import GaussianMixture as GMM

n_components = np.arange(1, 10)
gmm_models = [GMM(n, covariance_type='tied').fit(img2) for n in n_components]

1 2	`plt.plot(n_components, [m.bic(img2) for m in gmm_models], label='BIC') plt.xlabel('n_components')`

Text(0.5, 0, 'n_components')

png

所以 n 取 2 最合适。

54 - Unsupervised and supervised machine learning - a reminder

比较 Unsupervised and supervised machine learning。